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tion results, thus demonstrating correctness. For case studies, we compared actual implemented 
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rate the probes were in discovering actual network topology. Finally, we conducted real-world 
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Executive Summary 





Since the inception of the Advanced Research Projects Agency Network (ARPANet), whose 
offspring is the Internet, network measurement has been recognized as an important task as 
measurement would be critical for future development, evolution and deployment planning.In 
this research, we propose measures that intuitively indicate the magnitude of the change be- 
tween two networks (1.e., Internet topologies). These measures are both computationally fast 
and scalable, allowing analysis of Internet-size networks. The measures that this thesis intro- 
duces effectively capture both the structure and the change of the network in time. By applying 
these measures to the Internet, we get to observe possible indicators of events that influence 
structural changes on a portion of the Internet. We are also able to compare the accuracy of 
network topology discovered by probes versus the actual implemented topology. Our tests on 
datasets of three consecutive years for Egypt and Libya show that the measure picks up the 
disruptions to Internet communications during the Arab Spring revolution in the first months 
of 2011. During this period of disruption, the measure exceeded 40%, as compared to regular 


times (i.e., periods when there was no civil unrest) when it was below 5%. 
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CHAPTER 1: 


Introduction 





The advent of the Internet has brought about the unprecedented revolution in the world of com- 
puters and communications. Today, the Internet provides the means for individuals and their 


computers to interact without physical boundaries, regardless of geographic location. 


From a mathematical standpoint, topology studies the basic properties of space and its behavior 
under continuous deformations such as stretching and bending. For our research, we extend 
this study into the realm of the Internet by concerning ourselves with the connectivity changes 


between different computers. 


The Internet is comprised of routers, where each router has multiple interfaces that connect to 
other routers. As traffic arrives, routers perform a lookup operation in order to forward the 
traffic along to the next router closer to the final destination. Thus, “connectivity” in this thesis 
concerns the connections between these interfaces, but not necessarily the physical wires that 
connect these routers. This connectivity connects companies in cooperative competition, and 
thus, routing decisions are influenced by economics. Users do not have the ability to influence 
the path that their packets take. As such, the portion of the visible, logical Internet topology is 
generally dictated by packet routing, which is in turn economically driven. 


We study this topology by abstracting the Internet into a graph, G(E,V), where E represents the 
set of edges and V the set of vertices. The set of vertices represent the interfaces of the routers; 
and edges represent the logical inter-connection between these interfaces. 


1.1 Why Measure the Internet? 


Since the inception of the ARPANet, whose offspring is the Internet, network measurement has 
been recognized as an important task as: measurement would be critical for future development, 


evolution and deployment planning [1]. 


Now, myriad people and organizations have different reasons for measuring the Internet. These 


interests are usually commercial, social or technical in nature. 


Commercially, the ability to sell or provide information about a product to a large number 


of people requires a variety of Internet measurements. Such information could include the 


reach of the Internet and the number of households with connectivity in a given area. Some 
measurements are so important that companies have been created to provide such information. 
In fact, mapping the Internet is the core business of Content Delivery Network (CDN) operators 
such as Akamai [2]. A CDN is an infrastructure for efficient delivery of Web-related content to 
Internet users, and it consists of a large distributed system of servers that are deployed across the 
Internet. According to [3], Akamai alone serves 15 to 20% of all Web traffic. A CDN service is 
distinguished by its ability to supply on-demand capacity to content providers, and its ability to 
improve performance and due to its physical proximity to the user, improve the user’s access to 


content. These factors that distinguish a CDN obviously depend on Internet measurements. 


Socially, such measurements can provide insights into popular issues. Stemming from its wide- 
spread use, governments and research institutions may also desire such information to study 
social implications resulting from Internet use. From [4], we observe that Internet communi- 
cations were purposely disrupted in response to civilian protest and threats of civil war. An 
Internet blackout was imposed by the government who deemed that other forms of censorship 
during Arab Spring were either impractical or ineffective. Another study [5] assessed that many 
authoritarian states, concerned with the power of new technologies to catalyze political change, 
have taken various measures to filter, monitor or otherwise obstruct free speech online. Thus, 
governments have been known to use selective censorship to restrict Internet traffic for political, 


religious and cultural reasons. 


Technically, Internet measurements help us to understand and influence the design of network 
components and protocols that drive the very nature of the Internet. Understanding the network 
topology helps in identifying areas with possible performance problems and provides insight 


into how applications can adapt. 


1.2 Why Is the Internet Hard To Measure? 


In [6], Floyd et al. point out that the Internet is difficult to measure due to its scale, heterogeneity, 
and dynamics. The Internet is like an enormous living organism. Measuring the Internet is also 
problematic in that the Internet is itself changing. The number of components in the Internet 
is constantly growing due to its design for performance, redundancy and availability of sub- 
networks. Its complexity and dynamism are such that it is impossible to guarantee that packets 
sent one after another would take the exact same path. Regardless of the quality of measures 


obtained, these measures may not apply elsewhere or at another point in time. 


Technical, social and political factors also inhibit our ability to obtain measurements. Internet 
devices do not always provide appropriate measurements useful for understanding the network. 
Useful measures are made inaccessible due to the architecture of the Internet itself. For example, 
routing changes can increase delay or decrease available bandwidth; route flapping can cause 
packet reordering; and overloaded links or deficient router queue implementations can cause 
packet loss due to congestion [7]. Collection of information on the Internet usually results in 
datasets that are difficult to store, transfer, process and analyze. For example, the Cooperative 
Association of Internet Data Analysis (CAIDA) dataset [8] used in our case studies, as collected 
by one team averages ~4.4 Gigabyte (GB) for each cycle of probing. There are about 165 
cycles of probing for a year and three teams that conduct the probing. All in all, a year’s worth 
of probing would result in a dataset that amounts to about 2.2 Terabyte (TB). To exacerbate 
matters, converting this dataset originally in binary format into textual form for analysis would 


increase its size even further. 


Commercial service providers often retain proprietary information and do not share details of 
their networks, as the network’s configuration and topology can contain personal and business 
data that can violate privacy and raise security concerns. The introduction of the World Wide 
Web and the privatization of the Internet left no framework for adequate tracking and monitoring 
of the Internet. Internet Service Providers (ISPs) are profit driven and are more interested in 
meeting the demands of their rapidly growing customer base, than in gathering and analyzing 


network performance data. 


1.3. Research Question 


The Internet is an interconnected system of networks that connects computers around the world. 
How can we measure the extent of these changes in interconnectivity for such a complex and 


large-scale network? 


Given the size of the Internet, the measure needs to be intuitive, fast to compute and scaleable 
(1.e., capable of comparing networks of various sizes). By intuitive, we mean that the com- 
putation of the measure captures the change in the network, and the result from the measure 


indicates if the change was small or large. Thus, the measure’s result would be a scale. 


Datasets of information collected on the Internet can be used as a ready source of data. The 


measure can then be applied on these data to give intuitive results. 


1.4 Results 

We proposed innovative measures and applied it in several ways. We first compared against 
ground truth, which was obtained from a configuration file of a large university network as 
detailed in Section 5.3, with the same topology as inferred from CAIDA’s datasets. This com- 
parison against ground-truth was performed to find the difference between the actual topology 
and that which was discovered by probes. We then applied the measures to specific portions of 
the Internet as a case study, from which we identified events that influenced Internet connectiv- 
ity changes in Egypt and Libya during the Arab Spring. This is an indication that our measures 


could pick up signals of social events. 


1.5 Organization of Thesis 
In investigating the research question, this thesis is organized as follows: 


e Chapter 1 discusses the motivation of the research. 


Chapter 2 discusses prior and related work in the fields of measuring the Internet. 


Chapter 3 introduces the machinery used in the analysis conducted in the course of the 


research. 
Chapter 4 details the data used and the methodology that was taken. 


Chapter 5 contains the results of case studies conducted using our measures. 


Chapter 6 contains the summary and discusses possible areas for future work. 





CHAPTER: 2 
Background 





In this chapter, we introduce the reader to Internet topology. We take a look at existing measures 
and also put forward our measurements. By measurements, we mean performing analysis on 


data that has been collected on the Internet and producing a reading as a result. 


2.1 Overview of the Internet 

The Internet is an extremely large and complex network. The Internet works on the Internet 
protocol suite, which consists of a networking model and a set of communications protocols. 
The addressing scheme for this protocol suite is the Internet Protocol (IP) address. Routing is 


the process by which two nodes in the network find a path to each other. 


The Internet can be imagined as a collection of individual “communities” called Autonomous 
Systems (ASes), such as large companies and ISPs, and the physical infrastructure that con- 
nects ASes is referred to as backbones. The backbones connect ASes, and thus, ASes that are 
connected (i.e., adjacent), can communicate with each other. An AS is a network or a group 
of networks under a single administrative domain. ASes have a unique routing policy for their 
networks. Each AS is identified by a unique 32-bit number. Treating traffic flowing within an 
AS as internal, ASes help to draw the boundaries between external and internal routing. Routers 
(e.g., R11, R12, R21 and R31 in Figure 2.1) are devices that forward data between networks. A 
router is connected, via its interfaces, to two or more data lines from different networks. Each 
router builds up its routing table, which is essentially a list of preferred routes between any two 
destination addresses on interconnected networks. With multiple routers on a network, these 
routers can exchange information about destination addresses using routing protocols. Internal 
routing protocols are responsible for routing inside ASes, and likewise, external routing pro- 
tocols govern routing outside of ASes. External routers see ASes as a group of a few large 
networks. Internal subnetting (i.e., dividing a network into two or more networks) within an 
AS is transparent to external routers, and external routing information is not affected by internal 
routing information. As such, an AS, say a large company, can subdivide its networks according 
to its needs, without affecting its external routing rules. Thus ASes help to reduce the load of 
routing by reducing the number of entries of networks in its routing table and by delegating part 


of the responsibility of routing internally to the company possessing that network. 


A simplified representation of the Internet is shown in Figure 2.1 where the ASes are identi- 
fied by their respective Autonomous System Numbers (ASNs). In this illustration, backbones 
connect ASN 1 to ASNs 2 and 3. 


R21 ASN 2 


< R22 6m 





ASN 3 


Figure 2.1: The Internet simplified. 


2.2 Internet Topology 

Internet topology deals with finding the topological structure of the Internet. It is daunting 
to map the entire hierarchy of the Internet due to its sheer size and also the rate at which the 
Internet is growing and evolving. The effort to map the Internet is usually incomplete and out 
of date the moment it appears. 


2.2.1 Levels of Internet Topology 

There are different granularities at which we can study the topology of the Internet as we detail 
in this section. For each level of granularity, we represent the network according to how its 
components are connected and its corresponding graph representation(s). Among these levels 
of granularity mentioned, interface-level mapping offers the greatest amount of information 


with regard to connectivity. Our research limits itself to interface-level mapping, as detailed in 


the next paragraph, and the connectivity of certain segments of the Internet is what we will be 
investigating in the following chapters. 


AS-level topology. At the AS level, routers and subnets of a single AS are represented as one 
single node; and the adjacency relations between the ASes are represented by edges between the 
nodes. These relations can be generalized into provider-customer and peer relations [9]. A cus- 
tomer pays its provider for connectivity to the Internet outside of its administrative domain. A 
pair of peers agrees to exchange traffic between their respective customers free of charge. Thus, 
interdomain routing policies are constrained by contractual commercial agreements between 


administrative domains. The AS-level representation of Figure 2.1 is shown in Figure 2.2. 











(a) Network map. (b) Graph. 


Figure 2.2: AS-level representations. 


Subnet-level topology. Subnet-level mapping [10] involves discovering the IP addresses that 
are hosted on the same subnet. A subnet is defined by the set of interfaces that it accommodates. 
Representing a subnet as a vertex in a graph as shown in Figure 2.3b, a link between two subnets 


would represent the router that directly connects these two subnets to each other. 


Interface-level topology. Interface-level mapping expands on subnet-level mapping in that 
interfaces of routers and end hosts are described as well. Here, an interface is represented by 
a node and each direct connection between a pair of nodes is represented by a link or an edge. 
Recall that a router has more than one interface, so more than one graph might be required to 
capture these interface connections between routers. Figure 2.4 illustrates some of these graphs 


as seen from the perspective of different end points. 


ASN 2 


R31 


ar 


ASN 3 





(a) Network map. (b) Graph. 


Figure 2.3: Subnet-level representations. 


Router-level topology. Since a router can have more than one interface, each of which can have 
a different IP address, these interfaces within the same router can be determined by IP Alias 
Resolution? and grouped as one node. Thus, connections between node-pairs, or equivalent 
router-pairs, are represented by the links. Topology that is inferred from these connection- 
maps more closely resembles what is on the ground as the devices are physically routed in this 
manner. However, the information required for IP Alias Resolution is not easily obtainable, and 


techniques to do so are still under research [11-14]. 


2.2.2 Obtaining Network Topology 

There has been much research in the area of Network Topology Capture (NTC) algorithms. 
In [15], adaptive probing techniques were proposed. These techniques leveraged external 
knowledge and data from prior cycle(s) to guide the selection of probed destinations and as- 
signment of destinations to vantage points. By exploiting structural knowledge of the network, 
measurement cost was reduced. Many NTCs aim to capture network dynamics with a reduced 
discovery budget (i.e., number of probes), while maintaining a broad coverage of the network 


topology at the same time [16]. 


However, the results obtained from NTCs remain incomplete or incorrect due to inherent lim- 
itations in the collection. Many systems may be offline at the time when the probing was 


performed. Also, network obstacles, such as firewalls and load balancers, often prevent probes 





The process of identifying IP addresses belonging to the same router. 


R12 R31 





(a) Network map. 


Sve 


(b) Graph of interfaces as seen (c) Graph of interfaces as seen (d) Graph of interfaces as seen 
from X. from Y. from Z. 


(e) Graph of interfaces as seen  (f) Graph of nterfaces as seen from 
from R22. R31. 


Figure 2.4: Interface-level representations. 


from reaching their target destinations or may route the probes in an unexpected way. 


Ground-truth, which we refer to as the actual network implementation, would prove to be an 


ideal source of information for comparing the accuracy of results obtained from probes. Unfor- 


R11 R21 
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(a) Network map. (b) Graph. 
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Figure 2.5: Router-level representations. 


tunately, an organization’s network layout is a critical piece of information as it usually reveals 
the structure of the main means of communication in the organization. If this information were 
to fall into the wrong hands, an organization would allow itself to be susceptible to sabotage. 
Thus, due to security concerns, ground-truth information is not openly available. 


Nonetheless, despite the aforementioned challenges, the importance of measuring the Internet 
and its “demand” for data to work with has prompted some organizations to curate their results 
as data for use by the research community. For example, as part of its objective to collect 
and share data for scientific analysis of the Internet, CAIDA has a collection of datasets [17] 
resulting from both active and passive measurement of the Internet. We will be using some of 
these datasets in our research. 


2.2.3 Traceroute 
Traceroute [18] is a computer network tool used to show the route taken by packets across an 
IP network. The history of the route is recorded as the Round-trip times (RTTs)? of the packets 


received from each successive host in the route. 


Paris traceroute [19,20] is an improved version of the original traceroute program. The original 
traceroute did not perform well in the presence of routers that employ load balancing on packet 
header fields, since traceroute discovers hops along a route with a series of probe packets, while 
a load-balancing router can direct these probes along different paths. This led to inaccurate 
and incomplete paths, which resulted in erroneous network maps. By controlling packet header 


contents, Paris traceroute obtains a more precise picture of the actual routes that packets follow. 





The length of time it takes for a data packet to be sent plus the length of time it takes for an acknowledgement 
of that data packet to be received. 
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Figure 2.6: Classic versus Paris traceroute adapted from [19]. 
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Figure 2.6 illustrates how the Paris traceroute is preferred over the classic traceroute. Suppose 
we wanted to measure the route from a source to a specified destination. The ground truth is 
shown on the left, where router L balances load across across two paths, via routers A or B. The 


middle diagram in the figure shows a possible outcome when the classic traceroute is used. On 


the right is the output obtained by using the Paris traceroute, which more closely resembles the 
real topology. 


2.3 Existing Measurements 

Network topology ASes will be modeled by graphs in order to facilitate measurements. The 
interface-level maps will be represented by graphs, with vertices denoting the interfaces and 
edges denoting the pair-wise connection between the interfaces. (We detail our approach in 
Section 4.2.) 


There have been several ways to measure graphs. Most of the existing ones use statistical results 
about the Internet. In the following sub-sections, we briefly mention some standard ways of 
measuring networks by representing them as graphs. We will take a different approach in this 


thesis. (A formal introduction to Graph Theory is in Section 3.1.1.) 


2.3.1 Primitives 

These are some of the basic forms of measurements typically used in graph theory which serve 
as the foundation of more elaborate ones that follow. These measures give information about 
the graph but do not compare graphs, as we wish to do for our research. We do not compare our 
measures to these primitives. Nonetheless, these primitives are included here so that the reader 
is aware of innate measurements from graph theory. The terms and definitions used here are 
those found in [21]. 


Distance. If a graph G has a u, v-path, then the distance from u to v, written dg(u,v) or simply 
d(u,v), is the least length (or number of edges) of a u,v-path. lf G has no such path, then 
d(u,v) =. 


Diameter. The diameter of a graph G, denoted diam G, is the maximum distance between any 


two vertices in the graph, i.e., diam G = max, yey(g)d (U,V). 


Eccentricity. The eccentricity of a vertex u, written €(u), is the maximum distance of the 
specified vertex to all vertices in the graph, i.e., €(u) = max,cy(g)d(u,v). Informally, this is the 


largest distance between any two vertices in the graph. 


Radius. The radius of a graph G, written rad G, is the minimum eccentricity of all vertices in 
the graph. i.e., rad G= miNyey(G)E(U). Informally, this is the smallest distance between any 


two vertices in the graph. 
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2.3.2 Statistical Analysis 

Statistical analysis of network structure usually pertains to the selective pairing of identified 
types of vertices, also known as assortative mixing. The converse is referred to as disassortative 
mixing. Maslov et al. proposed in [22] that the structure of the Internet is characterized by 
three identified types of vertices: providers, who run the Internet backbone having high levels 
of connectivity; consumers (i.e., end users) of the Internet service; and the ISPs who join the 
providers and consumers. This particular form of assortative mixing can be seen as selective 
linking according to a scalar vertex property. In this case, the mixing is according to vertex 
degree and is commonly known as degree correlation. Several ways of quantifying degree cor- 
relations have been proposed in previous research. In [23,24], studies of the Internet calculated 
the mean degree @ of the network neighbors of a vertex as a function of the degree k of that 
vertex. A network is deemed to be assortatively mixed if @ increases with k, and disassortative 


if otherwise. The latter was found to be the case for the Internet. 


2.3.3 Centrality 
There are various types of measures for the centrality of a vertex within a graph. These deter- 
mine the relative importance of a vertex within the graph. Some of those that are widely used 


are mentioned here. 


Degree centrality, which is defined as the number of edges incident upon a vertex. Note that 
this is the same as the degree defined above, and some researchers use it as centrality. This 
can be further decomposed to indegree and outdegree which differentiates between the edges 
directed towards or out of the vertex in question respectively. 


Betweenness, when applied to vertex, quantifies the number of times a node acts as a bridge 
along the shortest path between two other vertices. Similarly, when applied to edges, it quanti- 
fies the number of times an edge is traversed along the shortest path between two other vertices. 


Closeness is defined as the inverse of the sum of distances from a vertex to all other vertices 
using the shortest path. The more central a vertex is, the lower is its total distance to all other 
vertices, and thus a higher closeness centrality. 


Eigenvector centrality works by assigning relative scores to all vertices in the network based 
on the concept that connections to high-scoring vertices contribute more to the score of the ver- 
tex in question than equal connections to low-scoring vertices. In other words, it measures the 


influence that a vertex has on a network. This can be computed using Singular Value Decom- 
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position (SVD). 


2.3.4 Graph Edit Distance (GED) 
The Graph Edit Distance (GED) [25] measures the minimum number of graph edit operations 


that are required to transform one graph to the other. These operations include: 


e insert/delete isolated vertex 
e insert/delete an edge 


e change the label of a vertex/edge (if labeled graphs) 


Thus, the more dissimilar two graphs are, the more transformation operations have to be per- 
formed. The time and space complexity involved in computing the GED of two graphs is 
exponentially proportionate to the number of vertices of the two graphs. For large graphs such 
as the Internet, the GED problem is computationally demanding, and in general, considered 
NP-Hard?. 


2.3.5 Spectral Analysis 

Clustering and spatial properties of the ASes of the Internet were characterized by Gkantsidis 
et al. [27] using spectral analysis. The analysis captured significant information about the clus- 
tering properties of the topology. Spectral analysis examines the eigenvectors corresponding to 
the largest eigenvalues of the normalized transposed adjacency matrix of the topology in ques- 
tion. The output is a plot which consists of a specified number of the largest eigenvalues which 
loosely corresponds to the eigenvectors of the main clusters in the topology. 


2.4 Thesis Contribution 


The existing primitive and statistical measures apply mostly to measurements taken within the 
graph. The GED and spectral analysis have been applied between different graphs and also 


across time. However, they give results that might require a further level of interpretation. 


The GED gives a number that quantifies the number of operations required to get from one 
graph to another, and this does not reflect the magnitude of the relative change between the two 
graphs, which is what interests us. From the GED, one was not able to determine how different 


the two graphs in comparison were, unless it was applied to two graphs with exactly the same 





4A problem is NP-hard if an algorithm for solving it can be translated into one for solving any NP-problem. 
NP-hard therefore means “at least as hard as any NP-problem” although it might, in fact, be harder. [26] 
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vertex set [28] which is not feasible in our case. Several samples might have to be taken to 


determine if the change was comparatively small or large. 


The spectral analysis gave a range of values which justly describes the different characteristics 


captured by the analysis. 


Our proposed measures capture the difference between two networks as a single number or 
index. This index intuitively indicates the magnitude of the change on a normalized scale. 


Thus, without making further comparisons, the measure indicates the “size” of the change. 
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CHAPTER 3: 
Behind the Scene 





In this chapter, we bring forth mathematical concepts that might help the reader view the world 
of networks in a mathematical perspective. Here, we also introduce quasi semi metrics which 
shed light when comparing graphs. We will also touch on these measures’ performance and 
scalability when applied to large networks such as the Internet. These measures are chosen 
primarily due to their applicability, as well as, how easy it is to understand their computation 


and the meaning of the result, their scalability and speed for computation. 


3.1 Preliminaries 

Each of our proposed measures gives a simple index (percentage of change). From the indices, 
it is immediately intuited by the reader how much the two graphs have changed in terms of 
vertices or edges. To derive at these indices, we first translate raw collected data into graphs 
using basic graph theory. After which, set operations are performed on either the vertex set or 
the edge set of the graphs. With the massive amount of data, appropriate visualization methods 
are used as they complement our measures. Much of the research would not have made much 


progress without the aid of these visualizations. 


3.1.1 Graph Theory Terminology and Concepts 
A majority of the terms and definitions used in this paper are those found in [21]. Terms and 


symbols not found in that text are referenced appropriately. 


A graph G is a triple consisting of a vertex set V(G), an edge set E(G), and a relation that 
associates with each edge two vertices (not necessarily distinct) called its endpoints. Multiple 
edges are edges having the same pair of endpoints. A simple graph is a graph having no loops 


or multiple edges. A loop is an edge whose endpoints are equal. 


In this paper, we consider only simple graphs, as the study is carried out on logical connections 
of network interfaces. Despite the possibility that two machines are connected physically by 
a common wire or link, multiple interfaces on one machine connect to interfaces on the other 
machine in a uniquely identifiable fashion. Thus, we do not have multiple edges. Likewise, 


loops which connect an interface to itself are not relevant to our study. 
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A simple graph can thus be specified by its vertex set and edge set, treating the edge set as one 
of unordered pairs of vertices and writing e = uv or e = vu, for an edge e with endpoints u and v. 
When u and v are the endpoints of an edge, they are adjacent and are neighbors. This is denoted 
by uv, meaning“u is adjacent to v’”’. Since the data used in our analysis was collected only when 
probes were sent, the information flow is bi-directional, so we restrict our graph model to only 


undirected edges. 


Complete Graph A complete graph is a simple graph of all whose vertices are pairwise adja- 
cent. We denote the (unlabelled) complete graph with n vertices as K,. 


Complete bipartite graph (biclique) A complete bipartite graph or biclique is a simple bi- 
partite graph in which two vertices are adjacent if and only if they are in different partite sets. 
When the sets have sizes r and s, the (unlabeled) biclique is denoted K,,.. 


Path A path is a simple graph whose vertices can be ordered so that two vertices are adjacent if 


and only if they are consecutive in the list. 


Cycle A cyclic graph is one with an equal number of vertices and edges whose vertices can be 
placed around a circle so that two vertices are adjacent if and only if they appear consecutively 


along the circle. 


Erdés Rényi (ER) Random Graph In 1959, Paul Erd6és and Alfréd Rényi introduced the 
model for generating random graphs, including one that sets an edge between each pair of 
vertices with equal probability, independently of the other edges. Motivated by the modeling of 
physical properties and by the analysis of algorithms in computer science and their applicability 
to the Internet, we look at random graphs to model and apply probabilistic techniques for the 
occurrence of events. In our context, such an event could be the existence or creation of edges 


between any two vertices. 


There are two closely related variants of the ER random graph model. 


e G(n,M) model. A graph is chosen uniformly at random from the collection of all graphs 
which have n vertices and M edges. 

e G(n,p) model. A graph is constructed by connecting vertices randomly. Each edge is 
included in the graph with probability p independent form every other edge. Equivalently, 


all graphs with n vertices and M edges have equal probability of p” (1 —p) (3)-M 
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Symmetric Difference Conventionally, if G and H are graphs with vertex set V, then the sym- 
metric difference GAH is the graph with vertex set V whose edges are all those edges appearing 
in exactly one of G and H. The symmetric difference in this case pertains only to the set of 


edges. 


Also, in our context, we consider graphs G and H as having different vertex sets and look 
not only at edges but vertices as well. We then propose two indicators "Vertex Symmetric 


Difference" and "Edge Symmetric Difference", as elaborated in the next section. 


3.1.2 Set Theory 
From [29], a complement of a set A refers to elements not in A. The relative complement of A 
with respect to a set B is the set of elements in B but not in A. The relative complement of A in 
B is formally denoted 

B\A={x€ B|x €A} 


3.2 Our Measures 
In this section we introduce an innovative way of comparing graphs that is intuitive and fast to 


compute. 


3.2.1 Vertex Symmetric Difference 
Here, we consider only vertices, and compare two graphs G and H having two vertex sets V(G) 


and V(H), respectively. 


Definition 3.2.1. For two graphs G and H, we define the vertex symmetric difference vsd(G, H) 


° IV(G)\V(H)| +|V(H) \V(@)| 


vsd\G.F) = "——“W@\t VA) 





Informally, the vertex symmetric difference counts the vertices that are in one graph and not 
the other. This is then normalized over the total number of vertices in both the graphs, without 


double-counting common vertices, so that it is relative to the order of the graph. 


In the case where graphs G and H have exactly the same set of vertices, vsd(G,H) = 0. If 
graphs G and H are disjoint, or have totally different vertices, vsd(G,H) = 1. 


Thus, we see that vsd of any two graphs lies in the closed interval [0, 1]. ie., 0 < vsd < 1. 
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On this scale, we are able to say intuitively if the difference between the two graphs (vertex- 


wise) is significant. 


For example, consider the two graphs in Figure 3.1. 








Figure 3.1: Example to illustrate vsd and esd between two graphs. 


We then have 


ViG)\VAD|+IV(A)\V(G)| _1+1_ 2 


ee a IV(G)|+|V(A)| ~ 6+6. 12 


= 16.7% 





3.2.2 Edge Symmetric Difference 

The measure introduced in this section concerns itself only with edge sets, and we call it edge 
symmetric difference and we present an instance of its use on the topology of the internet to 
compare snapshots of the same network from independent sources. We would like to remind 
the reader that all our graphs have labeled vertices, which allows us to perform symmetric 


difference on the set of edges of graphs. 


Definition 3.2.2. For two graphs G and H, the edge symmetric difference esd(G,H) is defined 


° lE(G) \ E(H)| +|E(H) \ E(G)| 


esd(G,H) = |E(G)| + |E(A)| 





Informally, the edge symmetric difference counts the edges present in one graph and not the 
other. This is then normalized over the total number of edges in both the graphs, without 


double-counting common edges, so that it is relative to the size of the graph. 


In the case where graphs G and H have exactly the same edges, esd(G, H) = 0. If graphs G and 
H are disjoint, or have totally different edges, esd(G,H) = 1. 
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Thus, we see that the esd of any two graphs lies in the closed interval [0,1]. ie., 0 < esd < 1. 
On this scale, we are able to say intuitively if the difference between the two graphs (edge-wise) 


is significant. 
For example, consider the two graphs in Figure 3.1. 


We have 


JE(G)\E(A)|+|E(A)\E(G)| _ 342 _ 5 


SAG = IE(G)|+|E(A)| ~ 847. 15 


= 33.3% 





3.2.3. Basic Properties of esd 

Note that esd(G,H) is not a metric since, for example, it does not satisfy triangle inequal- 
ity as we show below; namely we present an example of three graphs for which esd(G,H) + 
esd(H,K) = ote + aan = 5 — 3 = 3 = esd(G,K). 




















Figure 3.2: Counterexample for the triangle inequality for esd. 


However, esd(G,H) is a pseudo semimetric. Recall the definition of a pseudo semimetric. 











Definition 3.2.3. A pseudo semimetric on a set X is a function d : X xX X —+ Ryo such that, for 





every x,yEX: 

e d(x,x) =0 

e d(x,y) =d(y,x). 
By equal graphs we mean isomorphic graphs. It is easy to observe that both properties are true 
for the edge symmetric difference of graphs. 


Proposition 3.2.4. Let T be the set of all graphs. Then esd(G,H) is a pseudo semimetric onT. 
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Unlike a metric space, points in a pseudo semi-metric space need not be distinguishable; that 
is, one may have d(x,y) = 0 for distinct values x ~ y. And so it is interesting to study equiv- 
alence relations and equivalence classes to see what graphs we can not distinguish with this 


pseudometric, that is what graphs are in the same equivalence class. 


Theorem 3.2.5. Let G and H be two graphs, and define the relation & defined on I, the set of 
all graphs, by 
G&H <> esd(G,H)=0. 


Then & is an equivalence relation, whose equivalence classes are 
Lg ={H:H=GUak,, Va € Zso}, for all graphs G without isolated vertices. 


Proof: Since the reflexive and symmetric properties are easy to verify, we only show the tran- 
sitivity below. Let G,H, and K be three graphs and assume that GZH and H&K. This happens 
if and only if ég_y = é@y_x = 0. Since ég_x = ég_y + éy_x = 0, it follows that G2K. For 
the equivalence classes, G#H if and only if égay = 0, ie., E(G) = E(H), and we obtain 
So ={H:H=GUak;,Va € Zso} for all graphs G without isolated vertices. 


This implies that if we restrict our attention to just the numerator of the esd(G,H) (namely 
considering graphs of the same size) and we do not allow isolated vertices (which will not 


happen in the networks we study) then we have a metric. 


Proposition 3.2.6. Let I, be the set of all graphs with m edges and no isolated vertices. Then 


esd(G,H) is a metric on Tn. 


Proof: Since no isolated vertices are allowed it is easy to see now that esd(G,H) = 0 if and 
only if G,H have the same edge and vertex sets, and this happens if and only if G =H. The 
symmetric property is true because égay = vag. Finally, for the triangle inequality, consider 
this: 














@c—n\? |x|? 
esd(G,H)+esd(H,K) = = + aa 
ll@c—xl|? 
= = esd(G,K). 
2m 
This completes the proof. 7 


eee 


3.2.4 esd for Classes of Graphs 

We initially plotted the esd of complete and cyclic graphs as illustrated in Figure 3.3, and 
observed that they approach limits. This spurred us to obtain the corresponding theoretical 
proofs. This led to the following propositions, which initially considered other standard graphs, 
and is then extended to random graphs. 
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Figure 3.3: Symmetric differences for complete and cyclic graphs. 


Proposition 3.2.7. For the standard classes of graphs below, we find the edge symmetric differ- 


ence. 


. For the complete graph K, (n > 3) we have that esd(Ky+1,Kn) = 

. For the cyclic graph C, (n > 3) we have that esd(Cy+1,Cn) = ee 

. For the path P, (n > 2) we have that esd(Pn+1,Pn) = x. 

. For the complete bipartite graph Kym (n,m > 1) we have that 
esd(Kn+1.m;Knm) = sat 

. For the star graph S, (n> 1) we have that esd(Sn+1,Sn) = 


KR Bw NH MN 


lt = 
2n—1° 


NN 


. For the hypercube graph Q, with 2” vertices, we have that 
esd(Qn+1,On) = 3+ 3Gn7y: 
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Proof: Here is the proof for each of the items above. 


1. Since Kj (i > 3) has (5) edges, it follows that 


(CH) -G) _ Set iont) nd 
d(Kn Kn = ~ Fh ~ ~ 
esd (Kn41,Kn) eae s(nt+1+n—1) Won 


2. C; has i > 3 edges. Suppose the edges of C; are vjv2,v2V3,...,Vi-1Vj,Viv1. To obtain a 








cycle C;, we introduce a vertex v’ together with two new edges v;_;v’ and v’v;, and the 
omission of the edge v;_;v;. It follows that 


221 3 


BCG eee 
esd (Cu+1,Cn) (nt+1)+n 2n+1 


3. P;hasi—1 > 1 edges. Since P; C P,; and P;,; has | additional edge over P;, it follows 


that 
1 1 


d(Pr+1,P,) = ————~ = =. 
esd (Fn+1,Fr) n+(n—1) 2n-1 
4. Kj; has ij edges (i = 1,7 = 1). Without loss of generality, say we introduce a vertex to 
the first partition of K;,; to obtain Kj41,;. Since K;,; C Kj+1,; and Kj+1,; has j additional 
edges over Kj; ;, it follows that 
J 1 


d (Kj. ;,Kj = Qa tae = . 
esd (Ki, Kis1,j) ij+(it+1)j 2i+1 





5. Note that we define S,, here as a tree with 1 internal vertex and n— 1 leaves. Since 


S; C S;41 and $;,; has an one additional edge over Sj, it follows that 


1 1 


esd(Si+1,Si) = Gt) Sk 


Alternatively, S; is the complete bipartite graph K, ;_;. Using part 4. above, we have, 
iene een 
ESA\Di+1,9i) = ES pA i-1) = 7, =: : 
a ie pah oh Ge) = Dea 


6. Q; has i2'~! edges and 2! vertices. Since Q; C Qj; and Q;+ has i2'"! +2! = (i+2)2'"! 
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additional edges over Qj, it follows that 


(i+2)27°' i421 4 


esd(Qi+1,Qi) = G21! 340-3 7 3(3i+2)- 





This proves the result. 7 


We now consider large complex networks, namely the random graph types (Erdds-Rényi graphs [30]) 


as well as scale-free graphs (like the Barabasi-Albert graphs [31]). Recall that a complex net- 


work has the following properties: 


e scale free: many small degree vertices held together by a few large degree vertices 

e small world: small diameter (short paths between any two vertices) 

e evolution: high degree vertices emerge through addition of new vertices and the prefer- 
ential attachment of these vertices 

e competition: vertices with high fitness (ability to attract links) become high degree ver- 
tices 

e robustness: the ability of the system to maintain its function even if many neighborhoods 
do not function. Resilience against random error, but fragility to attacks 


e communities: locally dense neighborhoods that behave similarly. 


Networks with a complex topology and unknown organizing principles often appear random; 
thus random-graph theory is regularly used in the study of complex networks. To obtain the 
random graph G(n,p), we start with n vertices, and every pair of vertices is being connected 
with the same fixed probability p. Consequently the total number of edges is a random variable 
with the expectation value EV|E(G)| = pG): Note that when p is very small, the networks are 
very sparse, and as p increases the network will become more connected. Now we present an 


example that describes the expected value of esd(G(n+ 1, p),G(n, p)) for random graphs. 


It should be noted that esd(G,H) = 1 — p, if G and H are random graphs with the same fixed 
probability p. To see that it is true, recall the following lemma for expected value. We use 
EV (A) to denote the expected value of A since E(G) denotes the cardinality of the edge set of 
G. 


Lemma 3.2.8. Let sets A and B be arbitrarily chosen elements without repetitions from set R. 


Then AI [BI 
EV||ANB|] = —- 
[R| 
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Figure 3.4: Symmetric differences for random graphs (1/000 samples; p=0.25, 0.5, 0.75). 


Thus we have the following: 


Proposition 3.2.9. Let G and H be two random graphs of order n+ 1 and n and same fixed 
probability p, respectively. Then the expected value of esd(G,H) approaches | — p, as n + ©. 


Proof: Using Lemma 3.2.8 where A = E(G) and B = E(H) (and R = E(K,41) since we are 


sampling from all possible edges on n+ 1 vertices, we have the following: 





EV |esd(G,H)| 


EV ee OL EG | 


|E(G)|+ |E(A)| 
|E(G)|+|E(H)| —2|E(A) neo 
|E(G)|+ |E(A)| 
p-("5!) +p- (9) - 22 re 
p-("3')+p-) 
p:n? —2p?- aoe eo), ai 





= ev| 














| 
— 

| 
v 


As n — ©, we have the desired result that EV[esd(G,H)| > 1— p. / 
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Now, the Internet graph may look like it has a random structure at first since nobody directs 
the network flow or adjacency. Thus, of course, it would mean that the degree distribution 
would follow a Gaussian distribution, which does not happen to be the case in the snapshots 
that we can take of smaller networks. Modeling the Internet is still a great open problem with 
some breakthroughs pointing to the possibility of the Internet being a scale free network mainly 
due to preferential attachment that produces large hubs. Thus, research shows that the degree 
distribution follows a power law asymptotically x*, where generally 2 < k < 3 although k can 
follow outside of these bounds. And so, it makes sense to study this model as well. For our 
simulations of power law degree distributions we use the Barabdasi-Albert preferential attach- 
ment model [31]. The Barabasi-Albert-graph (n,m, seed=None), denoted by BA(n,m) returns 
a random graph using the Barabasi-Albert preferential attachment model. That is, it returns a 
graph of n vertices, grown by preferentially attaching new m vertices with a fixed probability p 


to the high degree of preexisting vertices as follows: 


e initially there are ng = n- p vertices, 


e vertex n;+ 1” is introduced with m = p-n; edges to the preexisting n; vertices (i > 0), 

deg v; _ degy; 
Yweviaydegy — 2|E(G*)|’ 
where E(G*) is the set of edges in the graph before the new vertex x is attached. 





e the probability, that vertex v; will be linked to a new vertex x, is 


Graphing the edge symmetric difference for such BA(n,m = np) for 4 <n < 255 gives the graph 
in Figure 3.5. 


The Barabasi-Albert model incorporates two important general concepts: growth and preferen- 
tial attachment, which exist widely in real networks. Theoretically computing the esd for these 
graphs would be desirable. However, we leave this as an open problem since we are not able to 


account for preferential attachment in our theoretical computations of esd. 
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Figure 3.5: Symmetric differences for Barabasi-Albert type graphs (/000 samples; 
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CHAPTER 4: 
Data and Methodology 





In this chapter, the mathematical concepts previously discussed are applied to the real world, 
specifically, the Internet. Here, we analyze datasets collected on the Internet, discuss how 


information pertaining to the analysis is extracted and touch on the facets of analysis. 


4.1 Source of Data 


There are two sources of data that we will be using for our analysis - CAIDA [32] and Naval 


Postgraduate School (NPS); and we will have a source of ground truth to compare against. 


The bulk of the analysis was done with datasets from CAIDA, which has active on-going and 
regularly scheduled data collection since 2008. As such, CAIDA has more historical data than 
NPS has. In the initial analysis of this data, results obtained from CAIDA were counter-intuitive 
to expected results. The analysis showed that the amount of change that was detected by the 
measurement was unusually high. It was in the range of 33% for vsd and 44% for esd, This 
reflects the network changes from one cycle of data collection to the next, as discovered by 
CAIDA’s probes when they traversed from their vantage points to their chosen destinations 


within Egypt. 


A program was then quickly put together by NPS researchers to send probes in a different 
fashion from CAIDA’s, and the results created the NPS dataset. These results were found to 
correlate to the measure’s expected results. This helped to verify the intuitions held for the 
measures and also shed new light on CAIDA’s data collection method. Hence forth, the data 


from this program is for convenience called NPS data. 


4.1.1 CAIDA data 

These datasets contained data collected by a set of Archipelago (Ark) monitors which are glob- 
ally distributed [33]. These monitors are located at vantage points and they act as collection 
stations for the probes that are sent. There are 77 such monitors, and they are located on all 


continents of the Earth, except for Antarctica [34]. 


Ark is CAIDA’s current active measurement infrastructure. Its development was motivated by 


the need to reduce the effort required to develop and deploy sophisticated large-scale mea- 
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surements and to go towards a community-oriented measurement infrastructure, which allows 
collaborators to run their measurements on security-hardened distributed platforms. Of the sev- 
eral measurement datasets obtained, of interest to our analysis was the Internet Protocol version 
4 (IPv4) Routed /24 Topology Dataset [8] explained below. 


Ark allows a coordinated team probing effort to obtain large-scale traceroute-based topology 
measurements. Currently, monitors are grouped into three teams and the measurement work 
is dynamically divided up among team members. With this parallelized process, a traceroute 
measurement to all routed /24’s > is obtained in about two to three days for a team of 17-18 
monitors probing nine and a half million /24’s. We call this period of two to three days of data 


collection a cycle. 


Data is collected by sending probes continuously from randomly selected vantage points in these 
monitors to destination IP addresses. From each IPv4 /24 prefix on the Internet, a destination 
is randomly selected, such that a random address in each prefix is probed approximately in 
every cycle. A complete cycle of probed data collection would take two to three days to be 
completed. For each cycle, this uncompressed data averaged ~ 3.5 GB in size and can be 
parsed by appropriate tools such as the sc_analysis_dump tool © included in the scamper [35] 
distribution package. 


4.1.2 NPS Data 
This is data collected by a Python program that was put together to obtain data on specific 


network segments in a deterministic fashion. The program sends out hourly probes from fixed 
vantage points to selected destinations. The initial motivation to collect this data was to ver- 
ify certain intuitions that we had on why the esd and vsd measures did not give readings as 
expected when applied on CAIDA’s data. Although the Ark infrastructure is also utilized, its 
methodology differed from CAIDA’s in that it probes a specified destination network segment 
from fixed vantage points. Recall that CAIDA probes randomly selected /24 addresses and 
from random vantage points as well. With the randomness caused by the source and destination 
points removed, the traceroutes between cycles of probing were expected to be more similar. 


This was verified when the esd and vsd measures applied to this dataset gave a low reading 





> An IPv4 address is a 32-bit integer value. /24 is the prefix of the IPv4 network starting at a given address, 
having 24 bits allocated for the network prefix. 

© This utility provides a dump of traceroute data in a textual format that is easily parsed by scripts.Each line in 
the output contains a summary of a single trace. This includes the interfaces visited and the delay of each response. 
The details of the output format can be seen in Appendix 6.3. 
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which averaged ~0.6% per cycle, as compared to the high reading of ~40% from CAIDA. 


4.1.3 Purdue University’s data 

The network topology of Purdue University is obtained from network configuration files that 
were used to configure the network devices within the campus. Router devices were first iden- 
tified by looking for configuration commands associated with the router configuration. The 
portions of data that related to router configurations were then extracted. Interfaces of these 
routers were then consolidated and sorted by subnets. Assuming that interfaces on the same 
subnet were connected, the inter-router connectivity was derived. Thus, Purdue’s internal net- 


work topology at the router level was obtained. 


4.2 Data Selection and Preparation 

We describe the data collected by CAIDA to give the reader a feel for the amount of processing 
involved. A parse of the entire dataset for one cycle collected by a single team gave over 
9,000,000 traces, from which we obtained approximately 900,000 vertices and 1,600,000 edges. 
Vertices, in our context, would represent interfaces (of routers), and edges would be connections 
between interfaces that were discovered by the traces. Due to the immense size of the Internet, 
to see a one percent change in our measurement readings, we would need a change in © 18,000 
vertices or © 32,000 edges to occur. With a dearth of events that could bring about such an 
impact at that level, only specific areas of the Internet were picked for a closer look, so that the 


readings would be more significant. 


A study [4] was previously done on the disruption of Internet communications in several North 
African countries. This disruption was in response to civilian protests and threats of civil war 
which occurred in the first months of 2011. With this as a lead, we chose Egypt and Libya as 
areas of interest to determine if our measurements could detect these events. Purdue University 


was also chosen for a comparison as ground truth. Details of these are in Chapter 5. 


4.2.1 Preparation 

The output from sc_analysis_dump tool® is a list of traceroute-like data. An entry in this list 
closely resembles the output from a traceroute, where measurements of transit delays of packets 
across an IP network are taken together with the history of the route recorded as RTT of the 
packets received from each successive router along the route. As we are concerned with how 


the network is mapped, rather than its performance, we concern ourselves only with the router 
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interfaces and disregard the RTT measurements. From one such entry, we have a sequence of 


interfaces in a fixed order. 


Representing the interfaces as vertices, we “string” up the interfaces in order. Thus, we have an 
edge from the first interface in the sequence to the second one, another edge from the second 
interface to the third one, and so on. So, from each entry in the list of traceroutes, we obtain 
a path. Doing likewise for the entire list of entries, we obtain several paths, and by forming a 
union of all these paths (discarding edge and vertex multiplicities), we build a graph. This graph 
is a snapshot representative of the network map discovered by the probes in that cycle. The vsd 


and esd measures allow us to compare two such graphs. 


4.2.2 Challenges 

Despite improvements from the use of the Paris traceroute as mentioned in Section 2.2.3, there 
exist routers that do not respond to the probes. Some of these routers drop the probing packets 
completely, while others choose to pass on the probing packets on to the next router (en-route 
to the intended destination) but still do not respond to the probes themselves. We refer to the 


former as probe-dropping routers and the latter as non-responding routers. 


The behavior of these routers results in several cases where we obtain incomplete traces. In 
the first case, where a router drops the probing packets completely, we obtain the traceroute 
that ends at the router that is upstream of this router. A second case, where we have non- 
responding immediate routers, results in a trace that is fragmented by one or more intermediate 
router’s non-response. Here, information of the network that is downstream of a non-responding 
intermediate router is captured since the probe can continue toward its intended destination. An 
outcome that is a combination of these two cases is also possible. So, we can have a traceroute 
which is fragmented by non-responding intermediate routers, and which terminates at a probe- 
dropping router. Of course, there are also cases where the probes traverse non-responding 


intermediate routers and arrive at their specified destination. 


Traces arriving at their destination and not encountering any non-responding intermediate routers 
en-route are referred to as complete; while traces that do not reach their intended destination 


and/or traverse non-responding immediate routers en-route are referred to as incomplete. 


In handling such behaviors, only deterministic interface connections recorded by the traceroute 
record were taken for analysis. Deterministic interface connections would refer to those con- 


nections that have interfaces from which a router has responded. A fragmented traceroute might 
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be seen as comprising more than one separate path, so the resultant graph constructed from the 


union of such paths could thus be disconnected. 
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Figure 4.1: Comparison of traceroutes with same source and destination addresses’. 


An example of different router behavior is illustrated in Figure 4.1. Here, we have three tracer- 
outes obtained with the same source and destination interfaces but at different times. The out- 
puts are different, and this difference shows the unreliability of traceroute probes. Also, when 
probing poorly connected regions of the Internet, such as Libya and Egypt in our case, packet 
loss is also a possibility. It is interesting to note that some routers respond at certain times and 
not at other times, as observed in the second hop of Figure 4.1. By comparing similar tracer- 
outes, further information could be gleaned. By matching information from these traceroutes, 
and taking into account similarities in intermediate routers’ interfaces as well as their RTT, the 
interfaces of non-response routers could be inferred with some level of certainty. In our exam- 
ple, the hops taken by the traces are largely similar. The non-response of the second hop in 


Figure 4.1 could probably be inferred from the other two traceroutes. 


4.3 Analysis 


In the initial analysis of CAIDA’s data, it was observed that the comparison of graphs in time 
(which represent snapshots of the Internet in general) gave a high reading on the esd and vsd, 





These data are extracted from fields 14 and onwards of recorded traceroutes. The value ’q’ denotes no response 
at that particular hop. Refer to Appendix 6.3 for details. 
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meaning very large changes in the topology. Investigations revealed that part of this was at- 
tributed to the way in which the data was collected. 


Recall that CAIDA randomly selects a destination from each routed IPv4 /24 prefix for prob- 
ing and that probes are then sent to these selected destinations from randomly selected vantage 
points. The effects of probing a selected /24 destination subnet from randomly selected van- 
tage points are described in Section 5.2 where we use NPS data for comparison with that of 
CAIDA’s. We shall now investigate the effect of probing randomly selected destinations in a 
/24 destination subnet. Here, two probes sent from the same vantage point to the same /24 
destination subnet could give different results despite taking the exact same path to the desti- 
nation subnet. The reason for this was that the final destination interface would probably be 
different due to the random selection process—an artifact of CAIDA’s probing methodology, 
since CAIDA’s goal is to discover as much of the real topology as possible. As a result, the 
difference between the graphs is artificially increased, and the measure reflects as such with a 
higher reading. To mitigate this effect, all of our analysis is conducted with the exclusion of this 


last destination interface. 
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Figure 4.2: Effect of randomly selected last hop on reading differences. 


Figure 4.2 is a plot of inter-cycle vsd differences of Egypt and Libya ASes for the years 2010 to 
2012 using CAIDA datasets. The vsd difference plotted in Figure 4.2 is given by the difference 
between inter-cycle vsd readings which includes the randomly selected last destination interface 
and the difference between inter-cycle vsd readings which excludes the randomly selected last 
destination interface. 1.e., vsd difference= vSdinclude random destination — VSdexclude random destination: 
We do likewise for the esd. It is observed that both the vsd difference and esd difference are 
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positive throughout the entire period, indicating that the inclusion of the randomly selected final 
destination does indeed result in an increase in the measurement readings. The distinct drop in 


measurements that occurred in February 2011 will be discussed further in Section 5.1. 


4.3.1 Spatial 

To observe how different segments of a network behave, we can dissect the network with dif- 
ferent levels of granularity. The level of granularity could depend on the size of the desired 
network segment(s). For our case where the segregation is done at a country level, samples of 
ASes representative of a particular country are first obtained. To do so, the registered organi- 
zational names of ASes matching the chosen country are obtained from the Regional Internet 
Registry (RIR). ® We thus have the corresponding ASNs for these ASes. Prefixes, as announced 
by these ASNs, are obtained by observing the global Border Gateway Protocol (BGP) ° routing 
table as provided by [36]. Only the /24 prefixes within the larger aggregates announced by the 
ASNs are considered in the analysis. 


To go further in depth into the analysis, the topology external and internal to an identified AS 
can also be studied. With the ASes of a country identified, analyzing these would be analogous 
to looking at the topology external and internal to the country. 


In technical terms, this is achieved by bisecting the traceroute data. The first part consists 
of traces from the source to the AS, and the second, the point of entry from the AS to the 
destination. With this information, the measure is effectively “broken down,” such that the 


contribution made by the two parts, internal or external, can be easily identified. 


Henceforth in our research, references to internal and external networks are from the perspective 


of the ASes, and combined network refer to the convolution of both these networks. 


Here are some perspectives that an analysis can offer. Depending on the desired analysis, a 


combination of perspectives is also possible. 


4.3.2 Temporal 
To observe how a network behaves in time, snapshots of the network can be taken at time 


intervals. In fact, an example is previously seen in Figure 4.2 where we discussed the effects 





8An organization that manages the allocation and registration of Internet number resources within a particular 
region of the world. Internet number resources include IP addresses and ASNs. 

°The protocol which is used to make core routing decisions on the Internet; it involves a table of IP networks 
or “prefixes” which designate network reachability among ASes 
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of including randomly selected destinations in our analysis. We consider readings in the time 
intervals to to ft), t) to t2, and so on until the last time interval t,_ to t,, where t¢,, is the desired 
end time. In so doing, we obtain readings in the desired time range fg to f, with a step size of 
one unit. In this case, a unit is the time interval from one cycle of CAIDA’s probing to the next 


cycle. 


Building on this, a snapshot from one point in time ¢; could be compared to another in any other 
point in time f; in a non-sequential fashion. In other words, the network at time fo is compared 
to the network at time ¢,,f2 and so on until ¢,; the network at time f; is compared to the network 
at time f2,f3 and so on until ¢,,; and doing likewise, all other values of f; are iteratively compared 
to t;. With this, a graph can be constructed to facilitate the study of possible relationships of the 
measure from one point in time to any other time. Note that since the measure is symmetric, 
the entire space of size n? need not be computed, rather just 7°/2. We will briefly illustrate the 


use of non-sequential time comparison in Chapter 5 of our analysis. 


Comparison using windows (i1.e., time frames) of varying sizes can also be done. To perform 
a comparison for a window of a specified size, all cycles within that window are overlapped, 
giving a bigger and denser snapshot. Recall that each cycle is represented as a graph, so the 
graph representation of the window is the union of the cycle graphs within that window, again, 
discarding multiplicities. Comparison between windows is then done by comparing this union- 
of-graphs with the next union-of-graphs and so on. In the analysis, windows of varying du- 
rations - daily, monthly, two-monthly (i.e., every two months), and three-monthly (1.e., every 
three months, or quarterly) were used. Since the duration of one cycle of CAIDA’s probing 
is approximately two and a half days, these inter-cycle plots are loosely referred to as “daily” 
plots. A monthly plot would be the accumulation of data over several cycles with a calendar 
month cut-off. To obtain a graph representative of one month, the “daily” cycles for that month 
would be accumulated by taking the union of these “daily” cycle graphs. The process described 
is repeated to obtain the data for each subsequent month, after which the graph of one month 
is compared to the one of the next month, and so on. Likewise, a two-month and three-month 
reading is the accumulation of cycles within cut-offs of two and three calendar months respec- 


tively. 


4.3.3 Frequency and Probability Distribution 
The frequencies of different class intervals can be tabulated and presented in the form of a 


histogram. The class intervals in this case would be ranged values of the measures’ indices. 
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By normalizing these frequencies, we can obtain the approximate probability distribution of the 
measures’ indices. In this, the probability for each class interval is easily observed, with the 


total area equalling one. 


The histogram helps us to quickly identify the shape, mode and spread of the readings quickly. 
The shape would indicate how the classes are distributed; the mode would be the most frequently- 
occurring class; and the spread shows how different the values are from each other and from the 


mode. 


We will use this in Chapter 5 of our analysis. 


37 


THIS PAGE INTENTIONALLY LEFT BLANK 


38 





CHAPTER 5: 
Case Studies and Results 





In this chapter, we detail the results obtained by applying the esd and vsd measures in various 
ways. In all cases, the measures are applied to graphs that represent inferred segments of iden- 
tified Internet topology. The chosen segments roughly represent countries or institutions. These 
specific segments were chosen due to interesting live events that were likely to exhibit signifi- 
cant topological changes and temporal deviations. Taking the lead from [4], the Egyptian and 
Libyan ASes were chosen as representatives of their respective countries’ network-structuring 


bodies. 


5.1 Egypt/Libya Network from CAIDA Data 


We focus on a specified area of the Internet—the Egyptian and Libyan network. Some re- 
sults from the application of esd and vsd measurements over a three-year duration probing 


Egypt/Libya are presented and discussed. 


Three-years’ worth of CAIDA datasets, as detailed in Table 5.1, were analyzed. The identified 
Egyptian and Libyan ASes that were chosen for this case study are as shown in Table 5.2. 
Traces with destination addresses that fell within the prefixes of these identified ASes were then 
extracted and analyzed. 


Traces processed Distinct edges Distinct vertices 
discovered discovered 








Containing identified 595,783 
Egypt/Libya Prefixes 





27,521 8,700 
Table 5.1: Statistics of CAIDA dataset used for three-year study of Egypt and Libya. 


The ASes are identified, and the prefixes they originate in the global BGP table as inferred by 


Routeviews [36] are shown in Table 5.2. 


By focusing on these ASes, an inferred topology of the networks encompassing Egypt and 
Libya was constructed. In our context, the address prefixes of these ASes are representative of 
the countries mentioned. Note that other ASes and prefixes reside in Egypt and Libya. Our goal 
was not to be exhaustive, but rather to focus on several large networks known to be in these 


countries that would serve as representative destinations to probe. 


39 





ASN 13388, EGYPTIAN-TELEPHONE - | 65.214.64.0/21 

Egyptian Telephone: 208.103.192.0/19 
216.138.48.0/20 
216.138.56.0/21 
216.158.112.0/20 








ASN 25364, EgyptCyberCenter-AS: 81.29.96.0/20 
ASN 21003, GPTC-AS Libya Telecom 41.208.64.0/18 
62.240.32.0/19 











41.252.0.0/14 
Table 5.2: ASes of Egypt and Libya that were used in our case study. 





5.1.1. Temporal 


Readings for “daily” plots (i.e., esd and vsd measures that compared cycle-to-cycle data) were 
observed to fluctuate substantially. To investigate the effects of using different window size 
for comparison, monthly, two-month and three-month (i.e., quarterly) plots were constructed as 


seen in Figure 5.3. 


Recall from Section 4.3.2 that a window size is increased by taking the union of the aggregated 
cycles within that window. It was observed that when the window sizes were increased in 
duration, the readings also increased. This happened for both esd and vsd readings. With a 
window size of a longer duration, more vertices and edges are accumulated within one window 
for comparison. We note here that when making comparisons using CAIDA data, the union of 
CAIDA cycles yields a better picture due to CAIDA’s random monitor selection when collecting 
data. The effect of CAIDA’s random monitor selection is discussed further in Section 5.2. So, 
if similar interfaces and connections come and go, their accumulation over time would give a 
zero net effect on esd and vsd measures. Likewise, accumulation of dissimilar interfaces and 
connections would result in an increased net effect on esd and vsd measures. The results seem 


to suggest the latter case. 


It could also be the case where the window size chosen was yet to be large enough to obtain the 
zero net effect. Notice that as the window size used for comparison increases, the readings for 
both esd and vsd increase (as seen in Figure 5.3), but the magnitude of fluctuations decreases 
(as seen in Figure 5.2), which is intuitive. Here, fluctuationesq = max(esd) — min(esd) and 
fluctuationysq = max(vsd) — min(vsd). We varied the window size used for comparison, and 
plotted the magnitude of the fluctuations in readings per average cycles per given window size, 


against the different window sizes. The average cycles per given window are as shown in 


40 





0.8 


0.7 


AN. ee 


monthly.all 


eos aA pS eee ~ 
—— 2monthly.all 


eh Atta | 
al Aa alba Mit ry 


0.2 


—— 3monthly.all 














0 


T T T T 
2010-01-01 2010-07-01 2011-01-01 2011-07-01 2012-01-01 2012-07-01 








(a) esd. 











0.5 
| —— daily.all 
vsd 0.4 I —— monthly.all 
— —— 2monthly.all 
> —— 3monthly.all 
0.3 


YE 
2S a wll i i a Oy Att 


Wil Vfl W Vl) VAY mV 


0.1 





0 T T T T T 
2010-01-01 2010-07-01 2011-01-01 2011-07-01 2012-01-01 2012-07-01 








(b) vsd. 


Figure 5.1: Readings for network (i.e., internal and external combined) of Egyptian and Libyan 
ASes. 
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Table 5.3. Thus, the magnitude in fluctuations is normalized, and this was done in order to 


capture the average fluctuation per cycle across different window sizes. Due to its exponential 


nature, the results are illustrated in Figure 5.2 using a logarithmic graph for the vertical axis. 


The results show that the decrease in fluctuation was most significant when the window size was 


increased from “daily” to monthly, where the readings drop by almost two orders of magnitude. 





Window size Average cycle(s) 
(duration) per given window size 
“daily” 1.0 
monthly 13.9 
2-monthly 27.8 
3-monthly 41.8 
6-monthly 83.5 
9-monthly 125.3 
yearly 167.0 





Table 5.3: Average cycle(s) per given window size. 
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Figure 5.2: Effect of comparison window size(duration) on fluctuations in readings. 


The results from Figure 5.2 seem to support the hypothesis that a near-zero net effect could be 


achieved by choosing a large enough window size. 
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5.1.2 Spatial 


Here we separate the networks into those internal and external to the ASes. 


In Figure 5.1a, we observed fluctuations in the plot for the combined network (1.e., internal 
and external) without distinct anomalies. However, in Figure 5.3 where we look at the plots of 
internal and external networks separately, pronounced changes were observed for the internal 
network, as compared to that for the external network. That is because even big changes in a 
few ASes, such as those in Egypt and Libya, will not have a big effect on the bigger picture of 


the Internet. 


In fact, the plot representing the external network closely resembles the one from the combined 
network. This observation was verified in that there were significantly more external edges as 
compared to internal ones, and is shown in Figure 5.4. One can think of the external network 


changes as noise that masks the changes happening internally. 


An interesting observation was made in Figure 5.3a. When there were changes in the network, 
such as during the Arab Spring, the esd readings were higher when a larger window size was 
used for comparison; and when the network was relatively stable, the converse took place in- 
stead. i.e., esd readings were lower when a larger window size was used. This shows that the 
change was not merely due to the drop in the number of vertices, but rather, in the instability of 
the network where different vertices came in and out of the network. This instability is exactly 


what we were hoping that our measure would depict. 


Delving deeper, we looked at the absolute rate of change of the index. This was obtained by 
plotting the absolute difference between two sequential readings. In other words, it gives us the 
magnitude of the time derivative of the index. Illustrated in Figure 5.5 is this rate of change 
with different granularities—daily, monthly, two-month and three-month basis. 


The rate-of-change plot helps to identify times of interest by accentuating the period of time in 


which changes in esd occur. 


Next, we take a brief look at network comparisons in non-sequential time. Thus far, we have 
been looking at comparisons in sequential time, meaning that we compared the snapshot of 
a network to another snapshot in a chronological manner. We now explore network compar- 
isons between one cycle (in time) to every other cycle (in time), where each cycle represents 


a time period of about three days. This comparison is done in order to consider the case of 
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Figure 5.3: Readings of esd for different network segments (i.e., internal and external) of Egypt 
and Libya. 
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Figure 5.4: Edge counts for Egypt and Libya AS from 2010 to 2012. 


constant changes in the network or merely consecutive changes in time. Considering the data 
in Figure 5.3, we notice the oscillating effect in the graph of the daily reading, so it is interest- 
ing to see if the network just bounces between two states or if it really is that living organism 
that keeps changing. We do so for the prior time intervals mentioned, and obtain Figure 5.6. 
The horizontal axes have been indexed such that they represent the chosen unit (i.e., “daily”, 
monthly, two-month or three-month) in a chronological fashion. Taking the two-month plot for 
example, the first two-month period (i.e., January 2010 to February 2010) is indexed as 0, the 
second two-month period (i.e., March 2010 to April 2010) is indexed as 1, and so on. 


We observe that there is a distinct “ridge” that corresponds to the period of February to March 
2011. The “ridge” represents a collection of points that have higher than normal esd readings 
as compared to all other readings in the three-year study. This depicts the period of time in 
which significant changes in the network were occurring within our area of interest, which we 
correlate with the Egyptian revolution. Also, notice that indeed there are changes of about 40% 
across the whole interval, when comparing any day to any other day. This confirms that there 
are constant changes in the Internet, and it is not the case that the Internet bounces back and 


forth between certain states. 
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Figure 5.5: Rate of change of esd for internal network of Egyptian and Libyan ASes. 


5.1.3. Frequency Distribution 

In Figure 5.7, we observe how the esd readings are distributed. A distinct mode is observed at 
about 0.25 throughout the three years. The shape of the histogram for the years of 2010 and 
2012 bear close resemblance to each other, while that of 2011 is remarkably different. This 


irregularity arouses the urge for further investigation. 


The readings are separated into internal and external networks in order to investigate which one 
of the two (i.e., internal or external) networks has greater influence on the combined network 
readings. The “breakdown” histogram in Figure 5.8 shows that the histogram of the external 


network resembles the combined one in Figure 5.7. 


As seen earlier in Figure 5.4, due to the significant larger proportion of edges that were con- 
tributed by external networks, the esd readings for external networks were found to closely 
resemble that of the combined network. At the same time, the readings for the internal network 
were more susceptible to changes inside the Egyptian and Libyan ASes of course, as these took 


place in a comparatively smaller network. 
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Egypt Libya Cycles ("daily") Egypt Libya Cycles (monthly) 





(a) “Daily”. (b) Monthly. 


Egypt Libya Cycles (2-monthly) Egypt Libya Cycles (3-monthly) 


ESD 
ESD 





(c) 2-Month. (d) 3-Month. 
Figure 5.6: Non-sequential esd readings for internal network of Egypt and Libya. 


By decomposing Figure 5.7 spatially into internal and external networks as illustrated in Fig- 
ure 5.8, we get a clearer view of the fluctuations in the readings for the internal network. Once 
again, the “noise” from the external network masks out changes in the internal network. The 
changes in the latter provide a good indication to social and political events that were happening 
to Egypt and Libya, and the figures above affirm that the events occur around the first months 


of 2011, which coincides with the Arab Spring timeline. 


5.2 Egypt Network from NPS Data 


In this section, we investigate the effects that randomly selected vantage points have on our 
measurements, which motivate the way NPS collects data. The prefixes used in this context are 
as shown in Table 5.4. 
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Frequency 














ar 0.2 0.3 0.4 0.5 0.6 0.7 
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(a) 2010. (b) 2011. 





egyptLibya.2012daily.all (b=10) 





Frequency 
N 
8 











820 0.22 0.24 0.26 0.28 0.30 0.32 0.34 = =60.36 0.38 
esd 


(c) 2012. 


Figure 5.7: Frequency distribution of “daily” esd readings for three years (combined network). 





ASN 13388, EGYPTIAN-TELEPHONE - | 65.214.64.0/21 

Egyptian Telephone: 208.103.192.0/19 
216.138.48.0/20 
216.138.56.0/21 
216.158.112.0/20 


Table 5.4: AS of Egypt that was used by NPS. 














The data was collected both by CAIDA and NPS over the same period of four weeks for the 


same fixed AS. Statistics of the data are as shown in Table 5.5 


Using the Ark infrastructure, probes were sent from pre-determined vantage points to specified 
destinations in the chosen AS above. This method of data collection differed from that of 


CAIDA’s in that the vantage points and the destination address were not randomly chosen. 
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(c) 2012. 


Figure 5.8: Frequency distribution of “daily” esd readings for three years (internal and external 
network breakdown). 











Number of traces Distinct edges Distinct vertices 

| processed discovered discovered 

NPS 18,173 Si 410 
CAIDA | 1,455 | 2,212 1,295 


Table 5.5: Statistics of dataset used by NPS for four-week study of Egypt. 


Recall from Section 4.1.2 that probes were sent out hourly for NPS’s collection method. To 
have a better comparison with similar data from CAIDA’s “daily” window size, NPS’s hourly 
data was first aggregated to a daily window size. Comparing NPS data versus CAIDA, we 
obtain the readings of esd and vsd of Figure 5.9. 


From Figure 5.9, we observe that the esd and vsd measurements from CAIDA are constantly 
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Figure 5.9: Effects of randomly chosen vantage points on measurements. 


higher than that obtained from NPS data. For sure, some of the increase in measurement read- 
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ings is attributed to the random selection of vantage points. The magnitude of this increase 
is depicted by the “Difference” plots, which takes the magnitude of the difference between 
CAIDA plots and NPS plots. For esd and vsd, these differences average approximately 48% 


and 37% respectively, which is very substantial. 


We have seen in Section 5.1.1 that the fluctuations in readings can be decreased by the use of 
larger window sizes used for comparison. It remains to be seen if we can reduce this artificial 
increase in measurements, which is attributed to using randomly selected vantage points, by 
using larger window sizes. However, we do not have sufficient NPS data over a longer time 


period to verify this. 


5.3. Purdue University - CAIDA versus Ground Truth 


For this section, we compare the topology derived from CAIDA’s datasets with the topology 
that is derived from Purdue University’s network configuration files. Henceforth, we refer to 
the latter as ground truth. The comparison is done at a different granularity level, but in the 
same way as before, since Purdue’s data gives us the router-level topology instead of the usual 


interface level ones with which we have been dealing. 


By sieving out the router configurations as described in Section 4.1.3, an interface-to-router 
hostname lookup table was constructed. This lookup table would allow us to obtain a unique 
identifier of the router (such as its hostname) that was associated with a given interface. 


From CAIDA’s data, we extracted only the interfaces that were internal to Purdue’s prefixes as 
listed in Table 5.6. Then, using the aforementioned lookup table, these interfaces were translated 


to their corresponding router hostnames to obtain a router-level topology. 





Purdue University 128.210.0.0/16 
128.211.0.0/16 
128.10.0.0/16 


Table 5.6: Prefixes used by Purdue University. 














Corresponding graphs were constructed from these two sets of data (i.e., topology obtained from 
Purdue’s configuration files and Purdue’s internal network derived from CAIDA’s datasets), and 
the esd and vsd measurements were applied. Here, again, the traces from CAIDA were trimmed 


and only the internal data to Purdue were used. 
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As Purdue’s configuration file was collected on January 6, 2011, the corresponding month’s 
dataset from CAIDA was taken for comparison. 


From Purdue’s configuration files, 178 vertices and 2975 edges were obtained. While from 
CAIDA, 160 vertices and 78 edges were obtained. Note that vertices represent routers and 
edges represent pair-wise connections between routers. Applying our measurements between 
these two sets of data, we obtained vsd and esd readings of 0.68 and 0.98 respectively. From the 
measurements, we infer that the internal network of Purdue University, as discovered by CAIDA 
probes, is quite different from that which is implemented on campus, particularly at the edge 
level. However, it was expected that we would not discover most of the physical connections. 


We elaborate on our findings in the following paragraphs. 


Denoting the graph representing ground truth as G and the derived topology graph from CAIDA 
as H, the comparison of these two graphs is visualized and is shown in Figure 5.10. The 


visualization has been color-coded as follows: 


e Red: These are vertices common in both G and H, and constitute 26.47% of all vertices. 

e Green: These are vertices present only in H and not in G, and constitute 12.75% of all 
vertices. 

e Blue: These are vertices present only in G and not in H, and constitute 60.78% of all 


vertices. 


The green vertices observed in Figure 5.10 are those with IP addresses, which have not been 
resolved from our lookup table. These probably represent addresses assigned to Purdue but 
used as the remote address for equipment not managed by Purdue, e.g., a point-to-poink link 
with a service provider. As such, they are not captured by the configuration file obtained. 
The blue vertices represent the routers from the configuration file that are not discovered by 
CAIDA. A fair amount of such vertices are expected, since the configuration file provides a 


more comprehensive source of information on Purdue’s internal network topology. 


Next, we investigate the “ideal” number of aggregated cycles from CAIDA that is required 
to obtain a topology that is most similar to that which is depicted by Purdue’s configuration 
file. By closest, we mean the comparison of two graphs which give the lowest vsd and/or esd. 
From operator feedback, we know that there has not been any significant change in the network 
configuration for Purdue. We begin by comparing the derived router-level ground truth topology 


with increasing window size used for comparison. In other words, we compare the ground truth 
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Figure 5.10: Visualization of comparison of two graphs - topology obtained from Purdue’s 


configuration files and that obtained from CAIDA. 


with one cycle, then ground truth with the union of two cycles aggregated, ground truth with 
the union of three cycles, and so on. Starting from the first cycle in January 2011 and with a 
window size of one cycle, we do this comparison iteratively, with window size increasing by 
one cycle in each iteration. This comparison is done for the entire year of 2011, and the plots 


obtained for the vsd and esd are as shown in Figure 5.11. 


We observe that vsd had the lowest reading when eight cycles were aggregated, after which the 
readings went up. We postulate that the optimal number of routers within Purdue’s network 


were discovered after eight cycles, after which increasingly many routers outside of the univer- 
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Figure 5.11: Comparison of aggregated cycles from CAIDA with topology obtained from Pur- 
due’s configuration files. 


sity’s jurisdiction were discovered. Meanwhile esd was observed to be decreasing slowly as 
more cycles were aggregated. The number of distinct connections between discovered routers 
(i.e., edges) was also found to increase at a much slower rate than the number of routers (i.e., 
vertices) themselves. This is counter-intuitive since routers usually have connections on more 
than one interface, and we would expect the number of edges to increase at a faster rate than the 


number of vertices. 
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CHAPTER 6: 


Future Work and Conclusion 





In this chapter, we present our findings and provide insights into areas that might require further 


research. 


6.1 Summary 
Our goal was to introduce a new and intuitive measurement for comparing graphs that is com- 


putationally fast and scaleable to large graphs. 


The esd and vsd measurements were proposed, and they capture changes in edges and ver- 
tices respectively. Being normalized, the measurement intuitively reveals whether differences 
between two graphs were low or high: with zero being lowest (i.e., the two graphs were ex- 
actly the same) and one being highest (i.e., the two graphs were completely different). These 
measures give better insight into why the networks are different than merely looking at counts 
of vertices and edges. That is, it captures if the networks bounce back and forth between cer- 
tain stages or constantly evolve, and we found that the latter was true when analyzing specific 
snapshots of the Internet router interface graph. 


By transforming the network-in-question into a graph, and then performing set operations for its 
set of vertices and set of edges, the measurements can scale up to accommodate the comparison 
of very large networks. Moreover, involving only simple set operations, the measurement is 


computationally fast. 


We tested these measures on data captured from traceroutes to a combination of a few fixed 
ASes, and discovered, as intuitively it makes sense, that the esd for changes within ASes did 
not correlate with esd outside of these ASes (i.e., the rest of the Internet). We then applied these 
measures just to the data internal to the union of these ASes that constitute Egypt and Libya, 
and we noted the correlation between political events happening in that area, and the change in 


the internal network according to our measures. 


6.2 Future Work 


We have barely scraped the surface for using these measurements, and there remains much to 


be explored. 


55 


The following are some areas for possible future research. 


(1) Perform similar measurements on different datasets. 


(2 


ww 


The bulk of the dataset used in our research uses CAIDA datasets. It would be insightful 
to consider using other datasets. In so doing, the difference in the datasets could point out 
where one excels over the other. Their individual strengths might then complement each 


other to build a more accurate topology for the Internet. 


It is also noted that this research utilizes the CAIDA dataset from only one team (i.e., 
team1). Due to time constraints, we were not able to study the data obtained by other 
teams. Comparing all team data, or even aggregating all data might serve to provide a 


richer data source. 


Possible correlation between the window size used for comparison and the readings that 


are taken from the measurements. 


In Section 5.1.1, specifically Figure 5.2, it was observed that the fluctuations of the read- 
ings were smaller when the window size used for comparison were larger. Ideally, it 
might be best to aggregate as large a window size as possible to build a “good” picture 
of the Internet. However, due to the Internet’s dynamism and also the nature of the study 
involved, it might not be practical to do so. Thus, the goal is to find the sweet-spot and 
strike the balance between having a good enough picture with the smallest possible win- 


dow size. 


In Section 5.1.2, we saw that when there were changes in the network, the esd readings 
were higher when a larger window size was used for comparison; and when the network 
was relatively stable (i.e., due to the lack of change or growth), the esd readings were 
lower when a larger window size was used. Further research could probably be con- 
ducted to study this relationship, which might speed up the process in identifying interest 


areas. 


(3) Non-sequential measurements between any two points in time. 
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In our research, we have looked mostly at comparing readings taken sequentially in time, 
briefly touching on non-sequential time analysis as seen in Figure 5.6. Further research 
could possibly look into “globally” expected network changes, which might shed light to 


better detection of network changes which deviate from the “global” norm. 


(4 
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Combined measure using both esd and vsd. 


Our research has essentially investigated the effects on esd and vsd separately. We saw 
correlations between the two, esd and vsd behaving similarly for networks, but it is not 
always the case that one is larger than the other. Because of the big difference between 
the esd and vsd for ground truth, we believe that a measure combining the two could be 
more appropriate. Thus, despite operating on different entities, these two measures could 


possibly be combined to give a more wholesome result when comparing networks. 
(5) Comparison of measures to primitives from standard graph theory. 


The primitives mentioned in Section 2.3.1 can be used for comparison with our mea- 
sures. Such a study could reveal possible correlations between conventional graph theory 


metrics and that of our measures. 


6.3 Conclusion 

With the accelerating growth of technology in modern times, connectivity between systems 
(people included) has exploded at an even faster pace. Again, our research points to the fact 
that the Internet really evolves, that new vertices and edges appear all the time, and maybe what 
was known at one point in time is not very relevant at some time later. Much remains to be 


analyzed from this “connectedness.” 


Having seen the application of our measurements on the Internet “organism,” these measure- 
ments could likewise be applied on biological and social networks, due to their similar dynamic 
nature and large size. Our hope is that the proposed measures serve to advance the broader 


community’s ability to compare large graphs of various types. 
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Appendix: SC Analysis Dump Format 











#{ sssssssssssss====5=2====5===555=55555555=5=5=5555555255555=55=5====5555======= 
# This file contains an ASCII representation of the IPv4 paths stored in 

# the binary skitter arts++ and scamper warts file formats. 

# 

# This ASCII file format is in the sk_analysis_dump text output 

# format: imdc.datcat.org/format/1-003W-7 

# 

#{ sssssssssssss====5=====5==5=55555=55555=5=5=5=5555552555=55==5=====5555======= 
# There is one trace per line, with the following tab-separated fields: 

# 

# 

# 1. Key -- Indicates the type of line and determines the meaning of the 

# remaining fields. This will always be ’T’ for an IP trace. 

# 

# ---a-- eon ono sn see Header Fields. ------------------ 


2. Source -- Source IP of skitter/scamper monitor performing the trace. 


3. Destination -- Destination IP being traced. 


4. ListId -- ID of the destination list containing this destination 


address. 
This value will be zero if no list ID was provided. (uint32_t) 
CycleId -- ID of current probing cycle (a cycle is a single run 
through a given list). For skitter traces, cycle IDs 
will be equal to or slightly earlier than the timestamp 
of the first trace in each cycle. There is no standard 
interpretation for scamper cycle IDs. 


This value will be zero if no cycle ID was provided. (uint32_t) 


6. Timestamp -- Timestamp when trace began to this destination. 


# + HH HF H HF H FH H HFK FH HF KH FH HOF FH FH FH OF 
on 


+7 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
Ps) 
® 
o 
hH 
eq 
Hy 
H- 
o 
i 
Q. 
a 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 


# 7. DestReplied -- Whether a response from the destination was received. 
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# + HH HF HH FH HH HF HO HF HO HF FH FH HO HF OH OH OF 


# + + + H FH H HF KH HF HOF OF 


FE 


R - Replied, reply was received 

N - Not-replied, no reply was received; 
Since skitter sends a packet with a TTL of 255 when it halts 
probing, it is still possible for the final destination to 
send a reply and for the HaltReasonData (see below) to not 
equal no_halt. Note: scamper does not perform this last-ditch 
probing at TTL 255. 


8. DestRTT -- RTT (ms) of first response packet from destination. 
0 if DestReplied is N. 


9. RequestTTL -- TTL set in request packet which elicited a response 
(echo reply) from the destination. 


0 if DestReplied is N. 


10. ReplyTTL -- TTL found in reply packet from destination; 
0 if DestReplied is N. 


11. HaltReason -- The reason, if any, why incremental probing stopped. 


12. HaltReasonData -- Extra data about why probing halted. 


HaltReason HaltReasonData 
S (success/no_halt) 0 

U (icmp_unreachable) icmp_code 

L (loop_detected) loop_length 

G (gap_detected) gap_limit 


13. PathComplete -- Whether all hops to destination were found. 
C - Complete, all hops found 
I - Incomplete, at least one hop is missing (i.e., did not 


respond) 


14. PerHopData -- Response data for the first hop. 
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If multiple IP addresses respond at the same hop, response data 


for each IP address are separated by semicolons: 


IP,RTT,numTries (for only one responding IP) 
IP,RTT,numTries;IP,RIT,numTries;... (for multiple responding IPs) 
where 


IP -- IP address which sent a TTL expired packet 
RIT -- RIT of the TTL expired packet 


num_tries -- num tries before response received from TTL. 


This field will have the value ’q’ if there was no response at 


this hop. 

15. PerHopData -- Response data for the second hop in the same format 
as field 14. 

N. PerHopData -- Response data for the destination 


(if destination replied). 


# # # # HH HF HF HH HO HF HF OF OF OH OF OF HOF OF OH OH HF OH OF 
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